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Abstract 

A challenging problem when studying a dynamical system is to 
find the interdependencies among its individual components. Several 
algorithms have been proposed to detect directed dynamical influences 
between time series. Two of the most used approaches are a model- 
free one (transfer entropy) and a model-based one (Granger causality). 
Several pitfalls are related to the presence or absence of assumptions 
in modeling the relevant features of the data. We tried to overcome 
those pitfalls using a neural network approach in which a model is built 
without any a priori assumptions. In this sense this method can be 
seen as a bridge between model-free and model-based approaches. The 
experiments performed will show that the method presented in this 
work can detect the correct dynamical information flows occurring in a 
system of time series. Additionally we adopt a non-uniform embedding 
framework according to which only the past states that actually help 
the prediction are entered into the model, improving the prediction and 
avoiding the risk of overfitting. This method also leads to a further 
improvement with respect to traditional Granger causality approaches 
when redundant variables (i.e. variables sharing the same information 
about the future of the system) are involved. Neural networks are also 
able to recognize dynamics in data sets completely different from the 
ones used during the training phase. Q 


1 Introduction 

A fundamental problem in the study of dynamical systems is how to find 
the interdependencies among their individual components, whose activity is 
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recorded and stored in time series. Over the last few years, considerable ef¬ 
fort lias been dedicated to the development of algorithms for the inference of 
causal relationships among subsystems, a problem which is strictly related 
to the estimate of the information flow among subsystems oca- Two ma¬ 
jor approaches to accomplish this task are Granger causality (GC) [3], 0] and 
transfer entropy (TE) [5|. GC is based on regression, testing whether a source 
variable (driver) is helpful to improve the prediction of a destination vari¬ 
able (target) beyond the degree to which the target predicts its own future. 
GC is a model-based approach, implying that the corresponding statistics 
for validation can be derived from analytic models, resulting in a fast and 
accurate analysis. A pitfall, however, is inherent to model-based approaches: 
the model assumed to explain the data often implies strong assumptions and 
the method is not able to detect the correct directed dynamical networks 
when these assumptions are not met. On the other hand non-parametric 
approaches, such as transfer entropy, allow the pattern of influences to be 
obtained in the absence of any guidance or constraints from theory; the 
main disadvantages of non-parametric methods are the unavailability of an¬ 
alytic formulas to evaluate the significance of the transfer entropy and the 
computational burden, typically heavier than those required by model-based 
approaches. 

Feed-forward neural networks, consisting of layers of interconnected arti¬ 
ficial neurons p|, are among the most widely used statistical tools for non- 
parametric regression. Relying on neural networks, the proposed approach 
to Granger causality will be both non-parametric and based on regression, 
thus realizing the Granger paradigm in a non-parametric fashion. 

In this paper we address the implementation of Granger’s original defi¬ 
nition of causality in the context of the artificial neural networks approach 
[7]. The metrics used to validate the hypothesis of directed influence is the 
prediction error : the difference between the network output and the ex¬ 
pected target. The choice of the correct prediction error, and consequently 
the choice of the past states of the time series that will be fed to the model, 
has to be accompanied by a validation phase. Only under optimization of 
the generalization error one can be sure that the network is not overfitting. 

In order to deal with an increasing number of inputs, each one repre¬ 
senting a specific candidate source of directed influence, we will adopt a 
non-uniform embedding procedure (8j that is an iterative procedure to select 
only the most informative past states of the system to predict the future 
of the target series among a wider number of available past states. In line 
with this procedure the network will be trained with an increasing number 
of inputs, each of them representing a precise past state of the variables that 
are most helpful to predict the target. Also this selection process will be 
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implemented using the notions of prediction error and generalization error, 
the former quantifying how well the training data are reproduced, the latter 
describing the goodness of the validation on a novel set of data. 

It is worth stressing that a neural networks approach to GC has been al¬ 
ready proposed in |9], where neural networks with a fixed number of inputs, 
together with other estimators of information flow, are used to evaluate GC. 
In J9] neural networks are trained without a validation set and an empiri¬ 
cal method to avoid overfitting is adopted. To our knowledge the present 
approach is the first time that non-uniform embedding and a regularization 
strategy by a validation set are used together in the context of neural network 
approaches to detect dynamic causal links. Moreover, the neural networks 
built by our approach will accomplish not only the task of estimating informa¬ 
tion flows among variables, they may also be used for dynamic classification 
task as well, as better explained in Subsection 16.61 The new method pre¬ 
sented in this work has been integrated in MuTE MATLAB toolboj^ [TO] 
and it will be compared here with the linear Granger causality as well as 
with the Transfer Entropy, both implemented in the non-uniform embedding 
framework. 


2 Introduction to neural networks 

Artificial neural networks (ANN) are a very popular branch of machine learn¬ 
ing. Here we give a brief introduction to neural networks to make this work 
self-consistent. 

Neural networks can be represented as oriented graphs whose nodes are 
simple processing elements called “neurons” handling their local input, con¬ 
sisting of a weighted summation of the outputs from the parents nodes [6]. 
The input signal is processed by means of a function, called “activation func¬ 
tion”, and the corresponding outcome, called “output”, is then sent to the 
linked nodes by a weighted connection; the weight is a real number that 
represents the degree of relevance of that connection inside the neural net¬ 
work. The most common architecture of a neural network consists of neurons 
ordered into layers. The first one is called “input layer” that receives the ex¬ 
ternal inputs. The last layer is called “output layer” that gives the result of 
the computations made by the whole network. All the layers between the 
input and output layer are called “hidden layers”. 

Neural networks with at least one hidden layer and activation functions 
as the sigmoid function on the hidden nodes are able to adequately approx- 

2 MuTE is a freeware toolbox. A detailed explanation is available on 
www.mutetoolbox.guru 
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imate all kinds of continuous functions defined on a compact set from a 
d-dimensional input space M. d , the domain, to a c-dimensional output space 
R c , the codomain given a sufficient number of hidden nodes: in this sense one 
can say that neural networks can perform any mapping between two different 
vector spaces [7]. In order to allow a neural network to find the correct map¬ 
ping, a so-called “learning phase” is needed. In this work we use supervised 
learning, during which inputs are presented to the network and its output is 
compared to a known output. The weights are adjusted by the network that 
tries to minimize a cost function that depends upon the network output and 
the known output. This kind of learning allows a network to discover hidden 
patterns inside the data. 

In this work we implement a growing neural network to study dynamical 
interactions in a system made up of several variables, described by time 
series, interacting with each other. The aim of the work is not only to find 
a directional relationship of influence between a subset of time series, the 
source , and a target time series taking into account the rest of the series 
collected in a set, called conditioning , but also to determine the delay at 
which the source variables are influencing the target. We will then see how 
the neural networks approach can be useful to accomplish, under the same 
framework, several tasks such as: finding the directed dynamical influences 
among variables chosen at a certain delay; predicting a target series when 
the network is fed with a novel realization of a dynamical system whose 
connectivity structure has been previously learned; classifying a new data 
set, giving information about how close the causal relationships are to those 
observed in data sets used during the learning phase. 

2.1 Mathematical framework 

In this work we deal with growing feed-forward neural networks to better infer 
the directed dynamical influences in a composite system. Each stochastic 
variable at hand is assumed to be zero mean (the mean of the data sample 
is subtracted from data), hence we will deal with neural networks without 
bias terms. A classical feed-forward neural network without bias is usually 
described by: a finite set of 0 nodes S = 1,2 ,,0 divided in d inputs, c 
output nodes and O — (d + c) hidden nodes; a finite set of one way direction 
connections C each one connecting a node belonging to the k-tli layer to a 
node belonging to the h-tli layer, with h > k. A weight Whk is associated 
with each connection from the node k E k to the node h E h. Each node 
o is characterized by an input function s a , an input value i 0 , an activation 
function f a , and an output value z a [7J. Let us now define as the weights 
vector of the connections which leave the nodes of the k-th layer and reach 
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the node h. Let us define z as the output vector of a generic layer of nodes. 
The input i h is given by i h = s h (w h ,z). Usually we have: i h = J2k w hk ■ z k . 
Each Zh is given by 


z h 



( 1 ) 


To evaluate the output of a multilayer network, consecutive applications 
of ([Tj) are needed to activate all network nodes. Figure [l]depicts the schematic 
structure of the feed-forward networks under discussion here. 



Figure 1: Schematic representation of a feed-forward neural network. 

To summarize: the output values of a network can be expressed as de¬ 
terministic functions of the inputs. Assuming that the network has only 
one hidden layer h, we can say that the whole network represents a func¬ 
tion, linear or non-linear depending on the linearity or non-linearity of the 
fh , between the d-dimensional input space and the c-dimensional output 
space, with parameters w given by the network weights. The relation hold¬ 
ing between inputs and outputs of the network can be approximated in the 
multidimensional space spanned by the hidden nodes by either an hyper¬ 
plane if linear functions are used as activation functions of the hidden nodes 
or by a smoother approximation when non-linear functions are set as activa¬ 
tion functions of the hidden nodes. Usually, the activation functions of the 
output nodes are set to be the identity function that does not modify the 
input values of the output nodes because the outputs of the network are not 
supposed to be bounded in order to assume values as close as possible to the 
training target values, assuming that overfitting is avoided. 
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So far we have shown how neural networks can process inputs and how 
they can be mapped onto a parametric function J-(x; w) : —y M c . 

We can now assume that there is a function / : x G W l —* /(x) G M c to be 
modelled and we know a finite set of N couples (x n , t n ), where n G [1,..., N], 
t n is the value of the function /(x) evaluated in x n plus an error e(x n ). We 
want to approximate / using the parametric function T : w G M p , x G —» 
^(xjw) G M c . The function T can be found through the minimization of 
a certain error function E( w). For instance a classical error function is the 
sum of squares function ([2]) to minimize by means of an iterative procedure 
that requires the data to be presented to the network several times through 
consecutive realizations called epochs'. 

^ N d 

E = 2 £ E w - *n 2 P) 

n=l k =1 

The training of the network is the process to determine, starting from a 
finite set of couples (x n , t n ), the weights w that can better shape T to be as 
close as possible to /. After each epoch of the training phase the weights in 
the network are adjusted. At this point a definition of close is in order. Let 
us suppose a noisy dataset consisting of x n and t n = /(x n ) + e(x n ) where 
e is the noise term. If we train the network until the input can be exactly 
reproduced then T is not only reproducing /, but the noise too. It is easy 
to understand that the more specialized the network the less it will be able 
to predict the right t n = /(x n ) + e(x n ) when a x n never seen before is 
presented to the network. In this case we say that the network is not able to 
generalize. To overcome this issue the validation phase is embedded in the 
learning. The validation phase is paramount because it allows the network to 
both model the function from which the data could have been drawn and to 
avoid modeling fluctuations produced by noise in the training set. In order 
to accomplish these two modeling tasks at the same time, the whole learning 
procedure is divided into two well distinguished steps: 

1. the whole data set is divided in two groups. One group is used for the 
training step during which the weights are updated 

2. the second group is used for the validation step. These data have not 
been used in the previous step. We used a maximum amount of training 
epochs and a smaller number of epochs called validation epochs. The 
validation phase is embedded in the learning phase: this combination of 
training and validation avoids erroneous use of the training procedure, 
thus avoiding overfitting. 
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In the following section, we present the algorithm used for the learning 
phase: 

1. training step: adjust the weights after a number of training epochs 

2. validation step: evaluate the generalization error and store it in a vector 
VEvect 

3. repeat steps 1. and 2. continuing to train the network until one of the 
following three stop conditions is verified: 

• the relative error evaluated as 

|| current VEvect entry — mean (previous VEvect entries) || 
current VEvect entry 

is less than a validation threshold set to 10~ 3 . The value of previ¬ 
ous VEvect is set to 5; 


(current VEvect entry — mean (previous VEvect entries)) 
current VEvect entry 


> 0 


• the maximum number of training epochs is reached. 


The previous VEvect and validation threshold values have been chosen taking 
into account a cautious gradient descent implying small updating steps as 
the main concern here is the risk of overfitting. 


3 Granger Causality with neural networks 

The aim of this work is to find directed dynamical influences among vari¬ 
ables, modeled as time series, using neural networks as a powerful tool to 
compute the prediction errors needed to evaluate causality in the Granger 
sense. According to the original definition, Granger causality (GC) deals with 
two linear models of the present state of a target variable. The first model 
does not include information about the past states of a driver variable, while 
the second model contains such information. If the second model’s error is 
less than that of the first model in predicting the present state of the target, 
then we can safely say that the driver is causing the target in the sense of 
Granger m Here we introduce a new Granger causality measure called 
neural networks Granger causality (NNGC) defined as 

A A GC = err re duced — err full (^) 
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where err re duced the prediction error obtained by the network that does 
not take into account the driver’s past states, while er ' r 'f u ll ^ ie prediction 
error evaluated by the network that takes into account the driver’s past 
states. 

Therefore, instead of fitting predefined models, (linear ones in the origi¬ 
nal proposal by Granger) we train a neural network to estimate the target 
using only the past states that can better explain the target series, by using 
the non-uniform embedding technique. Such strategy leads to growing neu¬ 
ral networks, with an increasing number of input neurons, each input neuron 
representing a past state chosen from the amount of past states available, con¬ 
sidering all the variables in the system. The architecture of the network and 
choice of the most suitable past states, performed through the non-uniform 
embedding approach, are described in detail in the next sections. Relying 
on neural networks, this method realizes the Granger paradigm in a non- 
parametric fashion, like in [ 121 IT3lj where radial basis function networks where 
employed. This article improves such previous work by (i) using non-uniform 
embedding and (ii) employing training and validation phases concurrently to 
ensure a more robust detection of dynamical interactions. 


4 Non-uniform embedding (NUE) 

We first introduce in this section the NUE approach which is the basis of the 
algorithm used to build a neural network able to find the correct mapping 
between the input and the output spaces in an optimal way. The uniform 
embedding (UE) approach relies on a predefined set of candidates making a 
strong assumption about which past states are better able to explain the fu¬ 
ture of the target series. This approach, lacking a specific criterion according 
to which the candidates are chosen, is likely to cause problems such as over¬ 
fitting and detection of false influences mm- NUE framework, instead, is 
an iterative procedure aimed at detecting only the time series’ past states 
that can effectively help to predict the target series. To evaluate whether 
a new candidate should be chosen, an hypothesis and, eventually, a signifi¬ 
cance test, should be satisfied. In this way of exploring causality, once this 
hypothesis and significance test (when needed) are no longer satisfied, the 
procedure is unable to find additional candidates to help predict the target. 

Let us consider a composite system described by a set of M interacting 
dynamical (sub) systems and suppose that, within the composite system, 
we are interested in evaluating the information flow from the source system 
X to the destination system y , collecting the remaining systems in the vec¬ 
tor Z = \Z k } k=l M _ 2 - We develop our framework under the assumption of 


stationarity, which allows to perform estimations replacing ensemble averages 
with time averages (for non-st at ionary formulations see, e.g., |16] and refer¬ 
ences therein). Accordingly, we denote X, Y and Z as the stationary stochas¬ 
tic processes describing the state visited by the systems A, y and Z over 
time, and X n , Y n and Z n as the stochastic variables obtained sampling the 
processes at the present time n. Moreover, we denote X~ = [X n _iX n _ 2 ...], 
Y~ = [Y n _iY n _ 2 ...], and Z~ = [Z n _ 1 Z n _ 2 ...] as the infinite-dimensional 
vector variables representing the whole past of the processes X, Y and Z. In 
some cases, taking the instantaneous influences of the candidate drivers into 
account as well may also be desirable. In such cases, the vectors X~ and Z~ 
defined above should also contain the present terms X n and Z n . 

We will discuss here the crucial issue of how to approximate the infinite¬ 
dimensional variables representing the past of the processes. This problem 
can be seen in terms of performing suitable conditioned embedding of the 
considered set of time series m 

The main idea is to reconstruct the past of the whole system represented 
by the processes X , Y, Z with reference to the present of the destination 
process Y, in order to obtain a vector V = [V r ) , V*, V^] containing the most 
significant past variables to explain the present of the destination. 

Non-uniform embedding constitutes the methodological advance with re¬ 
spect to the state of the art that we propose as a convenient alternative to 
UE. This approach is based on the progressive selection, from a set of candi¬ 
date variables including the past of X, Y, and Z considered up to a maximum 
lag ( candidate set), of the lagged variables which are more informative for 
the target variable Y n . At each step, selection is performed maximizing the 
amount of information that can be explained about Y by observing the vari¬ 
ables considered with their specific lag up to the current step. This results in 
a criterion for maximum relevance and minimum redundancy for candidate 
selection, so that the resulting embedding vector V = \V^ V^} includes 

only the components of X~, Y~ and Z~, which contribute most to the de¬ 
scription of Y n . Starting from the full candidate set, the procedure which 
prunes the less informative terms is described below: 

1. Get the matrix with all the candidate terms 
MC = [X n _i.. .X n 

—Ix^n —1 • • • ly^n-i ■ ■ ■ Z n _j z ], with l x , W, lz rep¬ 
resenting the maximum lag considered for the past variables of the 
observed processes; these matrices will contain also the terms X n and 
Z n in case one wants to take into account instantaneous effects. The 
values of lx, W, lz can be set by the experimenter according to a known 
feature of the data, or set to a reasonably large value for exploratory 
purposes. If values of lx, W and lz are set too low, an incorrect esti- 
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mation of Granger causality may result, but higher values should not 
issues with non-uniform embedding. 

2. Run the procedure to select the most informative past variables and 
the optimal embedding vector: 

(a) Initialize an empty embedding vector Vn 

(b) Perform a while loop on k, where k can assume values from 1 to 
the number of initial available candidates, numC , in the MC ma¬ 
trix. At the k— th iteration, after having chosen k — 1 candidates 
collected in the vector Vn k ~ 1 ' ) : 

for 1 < i < number of current candidate terms 

• add the i— th term of MC , Wn \ to a copy of Vn fe_1 ^ to form 
the temporary storage variable Vf = [WnVn k 

• compute the information exchanged between Y n and Vf 

(c) Among the tested Wn \ select the term W n which maximizes the 
information exchanged 

(d) if W n satisfies a termination criterion , delete it from MC and set 
k — k + 1. 

(e) else end the procedure setting k = numC + 1 and returning V = 

T/(k — 1) 
v n 

3. Use Y n and the full embedding vector V = [V r f V 7 f] and to evalu¬ 
ate err'| u u . err, rcc l ucc( } is obtained excluding from err pj]| the candi¬ 
dates belonging to the variables considered as drivers. Both errors are 
evaluated as the root mean squared error (RMSE) between the neu¬ 
ral network output and Y n . If the error resulting from the network 
that contains the inputs representing the driver’s past states (errpqj) 
is lower than the error resulting from the network that does not take 
into account the driver’s past states ( erT re duced)’ th en tl ie driver as¬ 
sessed is determined to help predict the target more than the network 
that excludes the driver. 

The complexity of the algorithm concerns mainly step 2, in particular 
step 2(b), involving a for loop nested inside a while loop: in the worst case 
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the body of the for loop is executed numC 2 times resulting in a complexity 
0{numC 2 ). 

Summarizing, the non-uniform embedding is a feature selection technique 
selecting, among the available variables describing the past of the observed 
processes, the most significant - in the sense of predictive information - for 
the target variable. Moreover, given the fact that the variables are included 
into the embedding vector only if associated with a significant contribution 
to the description of the target, the significance of the NNGC estimated with 
the NUE approach results simply from the selection of at least one lagged 
component of the source process. In other words, if at least one component 
from X is selected by NUE, the estimated NNGC is strictly positive and can 
be assumed to be significant. If not, the estimated NNGC is exactly zero 
and is assumed to be non-significant. This latter also occurs when the first 
candidate (k = 1) does not reach the desired level of significance, meaning 
that none of the candidates provides significant information about the target 
variable. This may also be enamntered, for instance, when the target process 
consists of white noise: the code will return an empty embedding vector and 
assign a zero value to the NNGC. 


5 Non-uniform embedding using neural networks 
(NeuNet NUE) 

Here we want to investigate the opportunity to use neural networks to create 
the two models needed to evaluate NNGC and, at the same time, to better 
choose the right candidates to be considered as terms of the models. In this 
sense our method is a model-free approach because we do not assume any 
model a priori that can explain the data, but we allow the network to explore 
the parameters space in order to find the model we need. The procedure will 
be able to model a function from the input space, spanned by the time series’ 
past states, and the output space, spanned by the present state of the target 
series: Y n = f(V). It will be possible to estimate a function T as close as 
possible to /. This will ensure a precise prediction of Y from Y itself, X and 
Z. It is easy to see that from T it is possible to assess whether for another 
data set Y',X', Z', the same relation T holds: in this case the network will 
be able to generalize. 

In this study a three-layers feed-forward neural network is used [7], trained 
by means of the resilient back propagation technique that is one of the fastest 
learning algorithms ptE]. Briefly, the resilient back propagation is an opti¬ 
mized algorithm to update the weights of a neural network based on the 
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gradient descent technique. Let A ?J - be the weight update value that only 
determines the size of the weight update and E the error function. Then the 
resilient back propagation rule is the following: 




r/ + x 1} 

T X A<H 

Ah- 1 ) 


9e h-b 

dwij 

if 9E h-b 

dw^ 


be h-b 

dwij 

dE h-b 

dwij 


> 0 
< 0 


else 


( 4 ) 


where 0 < 77 “ < 1 < r) + . To summarize: every time the partial derivative of 
the current weight changes in sign, i.e. the error function slope changes 
indicating that a local minimum has been avoided, the updated value A^ is 
decreased by the factor rj~ allowing a reversal, or “coming back”, in the pa¬ 
rameters space towards the local minimum. If the derivative does not change 
sign, then the updated value A 7J is increased by the factor r/ + accelerating 
towards a local minimum. 

Once the updated value is evaluated, the weight update is quite straight¬ 
forward as shown by the following equations: 



(t) 


-A' ■ 

*7 


+A 


h) 


*? 


if 9E h) 

dwij 

dE (*) 
dw ij 

else 


> 0 
< 0 


( 5 ) 


so that = wf) + Awl . However, we should also take into account the 

LJ LJ ij 1 

case when the partial derivative changes sign: the previous weight update is 
then reverted as follows: 


A 


(t) 

Wij 


-A 


O-i) 

Wij 


if dE ft-!) 
dw^ 


X 


dE ft) 
dw^ 


< 0. 


( 6 ) 


Following the NUE scheme, each input corresponds to a candidate, while 
the minimization criterion is the prediction error between the network output 
and Y n . We should keep in mind that the core of the entire procedure lies 
in the choice of the candidates that can actually help to predict the target 
series. Once the relative prediction error, defined as (prediction error fc _ 1 — 
prediction error fc )/(prediction error 1 — prediction error fc ) where k can assume 
values from 1 to the number of initial available candidates, is greater than 
or equal to a threshold, the procedure stops and no further candidates are 
chosen. To summarize: the hypothesis of Granger causality evaluates how 
much information is introduced by adding a new input with respect to the 
information carried only by the inputs previously considered. Moreover, it 
is worth stressing that in this case we do not rely on the comparison with 
a null distribution in order to choose whether a given candidate must be 
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chosen or not. On the other hand, when a driver-response relationship among 
variables holds, the algorithm will find, input by input, the candidate that 
will give the lowest prediction error, this being a condition that can hold 
only if we ensure the network is not overfitted. The risk of overfitting is the 
reason why a validation phase, described in detail in the following sections, 
was implemented and the idea of a fixed amount of training iterations was 
discarded. As soon as the error on the validation set, called generalization 
error increases, the training of the network stops ensuring the capability of 
the network to generalize, implying that it has not been overfitted. 

To better explain the steps implemented to select the past states as a 
pseudo-code we can say that a for loop is nested within a while loop. The 
while loop condition, that takes into account the decreasing of the prediction 
error during the training phase, determines whether or not an additional 
input should be added to the network. It is worth stressing that during the 
whole procedure of the candidates’ selection, the internal architecture of the 
network is kept fixed: the number of hidden nodes is set up as a fraction 
of the maximum number of candidates available, as shown in subsection 
m and it does not change. The for loop, instead, tests each available 
candidate given the previous inputs already chosen. During this test, at each 
iteration of the for loop, a network is trained taking into account the current 
candidate and the validation phase takes place according to the procedure 
explained in Subsection 12.11 point [3J Therefore, the validation error is taken 
into account in order to allow the network itself to reach its best performance, 
in terms of the generalization task, according to the current candidate. At 
the end of the for loop the candidate which gives the minimum prediction 
error is selected. If the prediction error satisfies the while loop condition, such 
that the relative prediction error is smaller than a threshold, the candidate 
is chosen and deleted from the set of the available candidates so that the 
procedure can continue. Otherwise, the procedure will stop. The pseudo¬ 
code of the algorithm is shown in the following: 

1: Initialize network parameters; 

2: Initialize the embedded matrix EM = 0 
3: Initialize the prediction error PE vector = 0 
4: Initialize the current prediction error CPE vector = 0 
5: Initialize final candidate matrix FCM = 0 

6. Initialize CS — —i • • • X n — i x , —i... Y n — i Yi —i... — iz\ m The terms X n and 

Z„ should be considered too in case the instantaneous effects should be taken into 
account. 

7: k = 1 

8: while CS ^ 0 do 
9: if CS is full then 

10: Initialize the network NN with one input, the number of chosen hidden nodes, 

one output 
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11: else 

12: Add to NNfc_i another input; 

13: Initialize only the weights between the new input and the hidden nodes keeping 

all the rest fixed; 

14: end if 

15: for i 6 [1, ..., length(CS)] do 

16: while epoch < maxTrainEpochs do 

% Learning phase: 

17: train the network, after 30 training epochs evaluate the prediction error; 

18: validate the network evaluating the generalized error; 

19: if epochs/valEpochs == 0 and epochs < maxTrainEpochs then 

20: evaluate the relative validation error 

21: if ||relative validation error|| < validationThreshold or relative error > 0 

or epochs == maxTrainEpochs then 
22: Store the prediction error in CPE(i) 

23: epoch = maxTrainEpochs + 1 

24: end if 

25: Store the prediction error in CPE(i); 

26: epoch = epoch + 1 

27: end if 

28: end while 

29: end for 

30: NNfc = neural network having in input the candidates that give the minimum 

prediction error stored in CPE 
31: PE fc = min(CPE) 

32: if relative prediction error < trainThreshold then 

33: NN = NN t _i 

34: PE = PE fc _i 

35: CS = 0 

36: else 

37: NN = NN fc 

38: add to FCM the candidate of CS that returns the minimum prediction error 

39: delete from CS the candidate that returns the minimum prediction error 

40: k = k + 1 

41: end if 

42: end while 

43: return NN; FCM; PE 

where the strings after the % symbol should be considered as non executable 
code. 

In the following we will explain in more detail the algorithm showed above: 

1. In the initialization phase it is worth noting that lx, W, lz represent 
the maximum lag considered for the past variables of the observed 
processes. In the following experiments we will set lx, W, lz to take 
into account more past states than needed. 

2. at the k- th step of the while loop at line [8], where k runs on the number 
of inputs chosen, the network tests all the candidates available by means 
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of the for loop at line [13 there are k inputs. The first k -1 inputs are 
the ones chosen so far and on the k-th. input one candidate per time is 
considered. The initial conditions are the same for each candidate: the 
weights have been fixed so the ones departing from the k -1 inputs are 
the same found as the result of the training at the (A;-l)-th step and the 
weights departing from the k -th input are the same at the beginning of 
each training session when the RMSE between the network output and 
Y n is evaluated. Lines mm take care of whether to stop the training 
phase for the current candidate according to the generalization error. 

3. lines I32HTT1 check whether it would be worth adding candidates, or it 
is better to stop the whole procedure because no further meaningful 
information can be added to better predict the target. The generaliza¬ 
tion error is not relevant at this stage, since it is only used to stop the 
training phase. 

The network is finally trained to reproduce the best correspondence between 
the space spanned by the terms of FCM and the space spanned by Y n . The 
network is then the model that can be used to explain Y n , including the 
driver’s candidates. This model will give err^jj. To evaluate erT re d ucec i the 
candidates belonging to the source system should be removed from the net¬ 
work, so the corresponding inputs and weights should not be considered. This 
configuration leaves the weights between the hidden nodes and the output 
unchanged, so the network now won’t be able to approximate Y n as well as 
during the evaluation of the previous term if the causal hypothesis holds be¬ 
tween the driver and the target. err re duced can com P u ted projecting the 
information carried by the inputs representing the candidates belonging to 
the target and the conditioning variables on the output space and evaluating 
the RMSE between the network output and Y n . NNGC is now immediately 
evaluated by the difference between the two terms. The significance of the 
causality measure estimated with the neural network method embedded into 
the NUE approach results simply from the selection of, at least, one lagged 
component of the driver. In other words, if at least one component from 
the driver is selected, the Granger causality is strictly positive and can be 
assumed as significant. If this is not the case, the estimated causality that 
results is exactly zero and is assumed to be non-significant. 

NUE is used here as a feature selection algorithm. Other feature selection 
algorithms can be used to select the most informative candidates; in the 
present work our choice is in line with other approaches to detect dynamical 
interactions present in literature, thus offering a coherent framework for all 
the estimators. 
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6 Applications to simulated data 

Before applying the proposed method, the correct initialization parameters 
were set (see Subsection E3D - First of all, we wanted to prove that neu¬ 
ral networks implemented with the non-uniform embedding framework per¬ 
form better than neural networks implemented with the uniform embedding 
framework, Subsection 16.21 Then, several datasets were simulated to test Ne- 
uNetNUE in different situations in order to explore its capability to detect 
the correct directed dynamical links, see Subsections 16.31 - 16.41 During those 
three experiments we compared the neural networks with a model-based ap¬ 
proach and two model-free approaches, as described in Subsection 16.31 to get 
a better idea of the performances obtained by our method. Furthermore, we 
wanted to check whether NeuNetNUE was both robust with respect to redun¬ 
dant information (see Subsection 16.511 and able to outperform an approach 
based on multivariate Granger causality analysis JT9] . Finally we wanted to 
evaluate the capability of the networks to predict and classify time series (see 
Subsection 16.61) . 

6.1 Choice of the parameters 

One of the crucial aspects of neural networks approaches concerns the choice 
of the optimal parameters. In this paper we are interested not only in the 
parameters involving the architecture and the training of the network, but 
also in the parameters that are responsible for the number of past states that 
can be chosen, allowing the approach to be more or less conservative. The 
parameters can be listed as follows: 

• the threshold according to which a certain number of past states are 
chosen (th). This parameter is taken into account to stop the training 
of the network and it consequently regulates the amount of past states 
chosen by the network: for a lower th more past states are selected, see 
Section [5] pointed list [3] 

• the validation threshold, useful to not overfit the network (valTli). This 
parameter plays an important role in the validation phase, allowing the 
network not to be overfitted, see Section [5] pointed list [2] 

• the number of hidden nodes (hidNodes), reported as percentage of the 
total amount of the available past states 

• the learning rates for the resilient back propagation, r] + 1 r]~. 
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The parameters mentioned above must be set so that the neural network 
approach is able to detect the expected information flow. The investigation 
of the best parameters values was performed on linear and non-linear models 
with memory up to 3 points in the past. 

We considered 20 simulations of the systems for each combination of the 
parameters values shown in table Q] We set lx = ly — lz = 5. Values 
of rj- and r] + , the parameters of the resilient back propagation, ranged as 
r)~ E [0.4,..., 0.9] with step of 0.1 and r] + E [1.1,..., 1.4] with step of 0.1. 
The threshold’s values for the prediction error range between 2 -8 and 0.3. 
According to this assumption, it is then possible to consider whether the 
current candidate is significant or not. Small values of the threshold, such 
as 2 -8 , represent a weakly conservative network. On the other hand, high 
values of the threshold, such as 0.3, represent a strongly conservative net¬ 
work. We first investigated how the network performed for higher values of 
the threshold and we found that the networks were too conservative and, 
consequently, NueNet NUE only found one candidate belonging to the tar¬ 
get series only. The same reasoning holds for the validation threshold that 
gives the range of values within which the validation error can fall. The as¬ 
sumption on how wide the range is, determines whether the network can be 
considered to have undergone enough training. Finally the number of hid¬ 
den nodes ranges between 0.1 and 2.5, with step of 0.2, times the number of 
available past states. In this way, we allow the network to have a number of 
hidden nodes so as to allow it to reach the best performance with increasing 
number of inputs. It turned out that number of hidden nodes = 1.3 x number 
of available candidates was the best compromise. 

For each combination of the parameters we evaluated how many times the 
method was able to detect the right information flows, estimating the num¬ 
ber of true positives (TP), true negatives (TN), false positives (FP) and false 
negatives (FN). We then evaluated sensitivity = TP /(TP+FN), specificity 
= TN / (TN+FP) and Fl S core = 2TP/(2TP 4~ FP 4- FN) and we checked 
how the performances changed as the parameters varied. We found that neu¬ 
ral networks obtained high performances on both systems corresponding to 
different parameters values: parameters values obtained on the linear system 
allowed NeuNet NUE to be less conservative with regard to neural networks 
used with parameters values found on the non-linear system. We finally chose 
the parameters in correspondence to which the network would be considered 
less conservative: th = 8 -3 ; valTli = 0.6; hidNodes = 0.3. Figure [2] shows the 
performances of the network for different values of th, keeping all the other 
parameters fixed. For a better visualization, only five points out of the nine 
showed in the table were plotted. The other four points have been omitted, 
being too close to the others in the figure: the resulting curve is virtually 
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unchanged. We can notice that for the minimum value of th, 2~ 8 , the net¬ 
work takes into account more past states than needed retrieving more FP and 
less TP than for higher values of th. Furthermore, the specificity is lower, 
almost zero, than the sensitivity. For th = 8 -3 the network’s performances 
give sensitivity = specificity = 1, while for the maximum value of th, 0.3, 
the network is not allowed to choose a lot of past states and, consequently, 
there are less TP and more TN than for lower values of th, resulting in the 
sensitivity lower than the specificity. 


Name Parameter 

Parameter values 

th 

2" 8 2 -6 2-4 2-3 5-3 8-3 l” 2 0.15 0.30 

valTli 

0.2 0.4 0.6 0.8 1 

hidNodes 

0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 

V~ 

0.4 0.5 0.6 0.7 0.8 0.9 


1.1 1.2 1.3 1.4 


Table 1: Parameters values to initialize the network. 

Sensitivity and Specificity vs Threshold values 
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Figure 2: Sensitivity and specificity at varying of the threshold values. 
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6.2 Reasoning behind our discarding of the uniform em¬ 
bedding approach 

Before explaining the experiments that we performed to test the proposed 
approach, we would like to underline that the non-uniform embedding frame¬ 
work was chosen because of its theoretical advantages with respect to the uni¬ 
form embedding. Furthermore, we wanted to check whether those advantages 
held in the case of the GC neural networks estimator too. Neural networks 
used with the uniform embedding approach (NeuNet UE) only need one net¬ 
work to be trained when errf u p is evaluated. Each input of the network 
represents a past state, therefore the number of inputs equals the number of 
available past states. The other network parameters have the same values 
as in the case of NeuNet NUE. The validation phase is still required. Once 
err reduced evaluated by only removing the inputs corresponding to the 
past states that belong to the driver whose influence to a specific target is 
tested, we can obtain a value of NNGC whose significance still has to be 
evaluated. This step is addressed using surrogates technique as implemented 
in the case of other estimators also used into the uniform embedding frame¬ 
work mu- This means that for each surrogate another network with the 
same architecture has to be trained resulting in a dramatic increase of the 
computational complexity. 

We compared NeuNet NUE and NeuNet UE performing a multivariate 
analysis on 100 realizations of a system composed of five coupled Henon maps 
with a length of 2500 time points, built according to the following equations: 

A_i,n = aV ( 1) — ( 0 . 5 c(X 4 i t_i +X 5 i t_i) + 

(1 — c)Xi t _iY + aV (2)W 1)t _ 2 + u>i, n 
X 2 ,t = aV(l) — (0.5c(X3 > t_i + Xs jt ^i)+ 

(1 — c)Xi }t _i) 2 + aV( 2 )W L ,t -2 + W 2 , n 
X 3 , t = aV ( 1 ) - Xl t _, + aV(2)X\ t _ 2 + w 3jU ( } 

A 4 t n — oV (1) — X‘lt_\ + aV ( 2 )AG,t -2 — 0.02cX 3jt -2 + 

X§j. = clV( 1 ) — (0.5 c(Xi tt _i + X 2) t~ 1 ) + 

(1 — c)X 5 ,t-l ) 2 + aV ( 2 )X§j -2 + w 3t n 

where aV is the characteristic parameter tuned for chaotic behavior, the 
coupling strength c = 0.4 and w is drawn from Gaussian noise with zero 
mean and unit variance. In figure [3] the modeled links are shown. 

We performed the analysis setting lx — ly — lz — 5 and using the rest of 
the parameters values found in Subsection 16.11 Looking at sensitivity, speci¬ 
ficity and Fl SCO re ? tabled we can clearly notice that NeuNet NUE performs 
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\ \t 

Figure 3: Simulated system. Interactions between the variables of the simu¬ 
lated Henon maps system generated according to equations (171) . 


better than NeuNet UE. Considering this result, the heavy computational 
complexity and the lack of information about the only past states that can 
give information to the target concerning the use of NeuNet UE, led us to 
only take into account NeuNet NUE for further investigations. 



Sens 

Spec 

Flscore 

NeuNet NUE 

0.86 

0.80 

0.80 

NeuNet UE 

0.69 

0.93 

0.77 


Table 2: Sensitivity, specificity and Fl score values obtained on the system 
([3) by NeuNet NUE and NeuNet UE. 


6.3 Simulated data: Henon maps 

In the first experiment we generated 6 Henon maps rearranging the system 
(I7j) as shown in figure [I] setting the coupling strength c = 0.2. The equations 
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are shown in the following 


Ai, n — aV( 1) — X^ n _ 1 + aV(2)X\ n -2 + wi, n 
X m ,n aV(l) (0.5c m (X m _ lin _! T X m j r \^ n _\)-\- 
T (1 C r n)X rn n — i) T aV ( X)X m n —2 T ^m,n 
X& ,n — aV(l) — Xg n _ 1 + aV(2)X 6t n_2 + Wq )U 

where aV is the vector of parameters for chaos, w is drawn from Gaussian 
noise with zero mean and unit variance and m E [2, 5] is the identifier of the 
series. 



Figure 4: Simulated system. Interactions between the variables of the simu¬ 
lated Henon maps system generated according to equations (151) . 

We generated 100 realizations of the Henon maps, performed a multivari¬ 
ate analysis keeping the parameters found in section 16.11 fixed and setting 
lx = W — lz — 5. We then evaluated the mean values of the NNGC for all 
the pairwise combinations driver-target as shown in figure [5} We compared 
our method’s performance with the binning, linear and nearest neighbor es¬ 
timators implemented in the non-uniform embedding framework (henceforth 
BIN NUE, LIN NUE and NN NUE). These three estimators are already 
implemented in MuTE Pi- The comparison with NeuNet NUE has been 
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performed in terms of sensitivity, specificity and Fl SCO re; as shown in table 
[31 We can notice that NeuNet NUE is the second best method after NN 


NUE. 


NeuNueNUE 



Figure 5: GC matrix representation for the NueNet NUE estimator applied 
to the system (jSj). The color indicates the magnitude of the GC averaged 
over 100 realizations of the simulation. The targets are plotted on the x-axis 
while on the drivers are plotted on the y-axis. 



Sens 

Spec 

Flscore 

BIN NUE 

1 

0.86 

0.84 

LIN NUE 

0.99 

0.74 

0.73 

NeuNet NUE 

1 

0.98 

0.98 

NN NUE 

1 

1 

1 


Table 3: Sensitivity, specificity and Fl score values obtained on the system 
(jgj) by the four estimators. 

Furthermore, we wanted to investigate whether NeuNet NUE was robust 
enough with respect to the coupling strength involved in (JSJ) using only 5 
time series. Again we performed a comparison with the estimators imple¬ 
mented in MuTE. In figures [6] - [9] we can see the performances of the four 
methods, noticing that NeuNet NUE is the only approach able to detect the 
expected information flows even when the coupling is 0.8. NN NUE detects 
two false positive information flows for the directions 4 —» 2, 2 —> 4 from 
coupling strength value equal to 0.6 on. BIN NUE and LIN NUE obtain the 
worst performances detecting false positive information flows even for cou¬ 
pling strength equal to 0.3, see figure [6] direction 2 —)■ 4. Note the differing 
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number of outliers in figure [6] versus figure [HI even if the detection criterion is 
fixed: on each box, the central mark is the median, the edges of the box are 
the 25tli and 75th percentiles, the whiskers extend to the most extreme data 
points not considered outliers. The four methods show different fluctuations 
of the exchanged information values as remarked by the different number of 
outliers. 

In figures [TUI dUwe compare the performances of the four methods show¬ 
ing their ROC curves and Flscore; respectively. NN NUE and NeuNet NUE 
ROC curves report the highest sensitivity and specificity as soon as the cou¬ 
pling is greater than zero. For high couplings the ROC curves denote a higher 
specificity of NeuNet NUE. BIN NUE starts with low sensitivity and speci¬ 
ficity, and its specihcity generally increases as the coupling increases. Fl SCO re 
curves belonging to NeuNet NUE and NN NUE are very close. For couplings 
greater than 0.5 NeuNet NUE Fl S core is higher than NN NUE Flscore; but 
both lower than BIN NUE Flscore denoting once again how NeuNet NUE 
can be much closer to model-free than to model-based approaches. Only at 
coupling = 0.6 NeuNet NUE has the highest Fl score . 
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Figure 6: LIN NUE performances on Henon maps at varying of the coupling 
strength. GC values are plotted on the y-axis, while the coupling strength 
values are plotted on the x-axis. 
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Figure 7: NeuNet NUE performances on Henon maps at varying of the cou¬ 
pling strength. NNGC values are plotted on the y-axis, while the coupling 
strength values are plotted on the x-axis. 


Referring to the equations (I7|) . we obtained the results shown in table @] in 
terms of sensitivity, specificity and Fl SCO re- This time NeuNet NUE detects 
causal influences better than BIN NUE, even if it is still not able to perform 
better than NN NUE. 



Sens 

Spec 

FI score 

BIN NUE 

0.85 

0.68 

0.73 

LIN NUE 

0.88 

0.45 

0.65 

NeuNet NUE 

0.86 

0.80 

0.80 

NN NUE 

0.87 

0.91 

87 


Table 4: Sensitivity, specificity and Fl SCO re values obtained on the system 
(17)1 by the four estimators 
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Figure 8: BIN NUE performances on Henon maps at varying of the coupling 
strength. TE values are plotted on the y-axis, while the coupling strength 
values are plotted on the x-axis. 

6.4 Simulated data: Lorenz system 

In the third experiment we studied a system composed of five identical Lorenz 
subsystems defined by the following equations: 


x\ = — lOaq + 10xi, iy = —10 Xi + 10 ay + C(ay _1 — ay), 

fh = ~xiZ! + 28ay - y u y t = —X& + 28®* - y t , (9) 

zi = xyy x - 8/3^i, Zi = Xiyi-8/3zi, 

where i G [2, 5]. The differential equations are solved by means of the Runge- 
Kutta method implemented in MATLAB and the time series are generated 
at a sampling rate of 0.01 time units. The subsystems, ranging from X\ 
to X 5 , influence each other according the following rule: z-th time series is 
influenced only by the (i — l)-th time series except for Xi that only gives 
influence to X 2 . The coupling strength C = 5 is the same for the whole set 
on influences. 

The nature of the Lorenz system results in a more challenging system 
than the Henon systems. The parameters set up have been kept the same as 
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Figure 9: NN NUE performances on Henon maps at varying of the coupling 
strength. TE values are plotted on the y-axis, while the coupling strength 
values are plotted on the x-axis. 


in the other experiments. Even in this scenario NeuNet NUE can reach good 
performances with both high sensitivity and specihcity, as shown in table [5] 
Our method, on this system too, reaches performances in the middle between 
the model-free approaches and the model-based approach. 



Sens 

Spec 

FI score 

BIN NUE 

0.94 

0.91 

0.82 

LIN NUE 

0.51 

0.72 

0.39 

NeuNet NUE 

0.70 

0.95 

0.74 

NN NUE 

1 

0.86 

0.77 


Table 5: Sensitivity, specihcity and F1,score values obtained on the Lorenz 
system by the four estimators. 

Another experiment on a chaotic Lorenz system was performed in order to 
check how robust NeuNEt NUE could be with respect to influences occurring 
at longer delays. We used 150 bidirectionally coupled Lorenz systems as in 
[20]. The delay at which series 1 influences series 2 was set at 45 points back, 
while the delay at which series 2 influences series 1 was set at 75 points back. 
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Roc curve at varying of the coupling strength 



Figure 10: ROC curves obtained on Henon maps at varying of the coupling 
strength. 


The coupling constant was set as 0.1 for both series. We chose 90 candidates 
for each series and checked how many times each candidate was chosen. As 
we can see in figures fl2lfl5l NN NUE can detect the right delays even if there 
are many other candidates chosen. NeuNet NUE was successful in retrieving 
the correct influences at the corresponding delays, more often than NN NUE 
as it can be seen from the height of the peaks. The other two estimators 
clearly failed in detecting the right influences and delays. 

6.5 Redundant data 

An issue that complicates the correct detection of GC is the presence of 
redundant variables. In this case the conditioning approach is misled and 
the analysis results in false negatives (see [.19] for a complete explanation of 
this phenomenon). We applied neural networks Granger causality analysis 
to redundant data to check whether the approach was able to detect the 
right information flows with an increasing number of redundant variables. 
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FI score at varying of the coupling strength 



Figure 11: Fl SCO re obtained on Henon maps at varying of the coupling 
strength. 

BINNUE candidates series 1 BINNUE candidates series 2 




Figure 12: BIN NUE performances on bidirectionally coupled Lorents sys¬ 
tem. 
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LINNUE candidates series 2 
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Figure 13: LIN NUE performances on bidirectionally coupled Lorents system. 


NEUNETNUE candidates series 1 


NEUNETNUE candidates series 2 




Delay 


Delay 


Figure 14: NeuNet NUE performances on bidirectionally coupled Lorents 
system. 


We used data generated by the following equations: 

t n hn—2 4” 

di t n h n — i 4" Cipn 


( 10 ) 
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Figure 15: NN NUE performances on bidirectionally coupled Lorents system. 


where the process h and the noises e, cp are drawn from a Gaussian distribu¬ 
tion with zero mean and unit variance. The coefficient c modulates the noise. 
The system represent a chain of influences, for which redundancy arises when 
i > 3, (the first two variables share information on the future of the third 
one), and so on. 

We compared NeuNet NUE with the fully conditioned non-linear ker¬ 
nel Granger causality as in |19j . The experiments were performed with 
lx = W = lz — 5 and keeping the parameters found in Subsection 16.11 
fixed. We generated 20 trials of the system (flUl) varying the number of re¬ 
dundant variables from 1 up to 20, with 2500 time points. The analyses were 
performed taking into account the variable t as targets and each variable di 
as driver, conditioning on the remaining o variables, with i £ [1,..., 20]. 
We then evaluated GC for both methods averaging over the number of trials, 
varying the number of redundant variables. According to the results shown 
in figure [T6l we can notice how GC detected by the neural networks never 
drops to zero, as it happens for kernel GC. Table [6] reports the number of 
false negatives given by NeuNet NUE. It is worth noting that the amount of 
false negatives is zero up to 10 redundant variables. Conversely, due to the 
different construction of the method, the values of kernel Granger causality 
are always significant, albeit very low, at least for this system size. 
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#RV 

12 3 

4 

5 

6 

7 

8 

9 

10 1112 13 141516171819 20 

#FN 

0 0 0 

0 

0 

0 

0 

0 

0 

0 3 4 1413 17 29 4149 57 79 


Table 6 : Number of false negatives, FN, returned by NeuNet NUE for 20 
trials at varying of the number of redundant variables, RV. 


Averaged GC vs number of redundant variables 



Figure 16: NeuNet NUE and multivariate GC performances on the redundant 
system. 

6.6 Classification task 

Our final goal in this paper was to test whether NeuNet NUE could correctly 
classify dynamics. We trained NeuNet NUE on the system (jHJ). We have 
randomly chosen one of the networks trained to detect the causal influences 
towards a certain target. Then we fed the network with 100 realizations of 
the system (jSjl . never used in the learning phase, and with 100 realizations 
of an autoregressive system, represented by the following equations: 

Xin = 0.95'\/2 — 0.9025 X ln _2 + Z\ tU 

X2,n — 0.5 X \ n _ 2 + Z 2 ,n 

Xz,n — 0-4 Ai ;n _3 + Z3 jTI (11) 

X4 >n = — 0.5 X^ n _ 2 + 0.25"\/2 + 0.25V2X 5 , n -i + Z4 )U 

Xr^ n = — 0.25\/2 X4 i71 _i + 0.25\/2^5 jn _i + z$ tn 

where z i jn , Z 2 , n , ^ 3 , n , Z 4 , n , Zs, n are drawn from Gaussian noise with zero mean 
and unit variance. 
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System [S] is considered, this time with only five variables to be consistent 
with the size of the autoregressive model. We then evaluated the average 
R.MSE for both the systems. We repeated the procedure for 30 different 
noise values, ranging from 0 to 0.7, and different couplings strength, ranging 
in the interval [0,0.8] with step of 0.2. The results are shown in figure [T71 
We plotted the average R.MSE with respect to the different noise level for 
each coupling strength value. We can notice that the errors obtained when 
the two systems are given to NeuNet NUE as test sets lie in linear separable 
portions of space. This represents an encouraging result as it may be useful 
to classify systems, given that our approach has been trained with a known 
system. 


Noisy data | AR System in blue, Henon Maps in red 



Figure 17: RMSE versus the noise level. The red curves are obtained testing 
NeuNet NUE on the same kind of data with which it was trained. The 
blue curves are obtained testing NeuNet NUE on the system (HU>. Each 
curve represents the network trained to detect the influence towards a specific 
target with a different coupling strength. 


7 Conclusions 

In this paper we have implemented the Granger paradigm for detection of 
dynamical influences in the frame of feed-forward neural networks. The nov¬ 
elty of the present approach arises from the use of non-uniform embedding 
for variable selection and generalization error for the assessment of Granger 
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causality. We have demonstrated the theoretical and experimental advan¬ 
tages of implementing the neural network approach with non-uniform em¬ 
bedding compared to the uniform one. Due to the universal character of 
function approximation of neural networks, the proposed approach is inter¬ 
mediate between the classical Granger linear implementation and the non- 
parametric estimator corresponding to transfer entropy: by means of several 
examples, we have shown that there are situations where our approach out¬ 
performs both approaches. The proposed method differs from the kernel 
Granger causality not only by providing a validation phase, but also by let¬ 
ting the neural networks explore the parameters space and building the best 
model to explain the information transfers among variables. Kernel Granger 
causality, instead, is still a model-based approach, for which the type of ker¬ 
nel and the degree of non-linearity have to be specified beforehand. We would 
like to remark that so far neural networks have been used to detect GC only 
when combined with other estimators. Furthermore the training phase was 
stopped only when a certain number or training epochs was reached. This 
choice seems quite approximate because of the lack of knowledge about the 
exact amount of training epochs needed to both minimize the error function 
and to avoid overfitting the neural networks. Therefore, the validation phase 
is necessary in our opinion to ensure that the network fully explores the pa¬ 
rameters space, converges to a minimum and avoids the risk of overfitting. 
We conclude remarking that other wrappers can be taken into account and 
many deep learning architectures are built from artificial neural networks, 
therefore we expect that further developments of our approach will be the 
implementations of Granger causality both using other feature selection al¬ 
gorithms and in the frame of deep learning EH 
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